Alea Seifert, Robelyn Felices, Franchesca Garcia, and Libby Brill
Introduction
Data Description and Cleaning
For our final project, we chose to focus on how carbon dioxide (CO\(_{2}\)) emissions impacts life expectancy. We used data sets taken from Gapminder, which have been combined and collected from a variety of sources. The carbon dioxide emissions data set contains 194 observations (countries) with 224 variables (ranging in year from 1800 to 2022). These data represent the recorded consumption-based CO\(_{2}\) emissions, in tonnes of CO\(_{2}\) per capita. The life expectancy data set contains 195 observations (countries) with 302 variables (ranging in year from 1800 to 2100). The life expectancy data was collected from three main sources including 100 sources compiled by Mattias Lindgren from 1800-1970, the Global Burden of Disease Study from the Institute for Health Metrics and Evaluations containing data for 1970 to 2016, and World Population Prospects from the UN forecasts which published interpolated demographics from 2017-2099. These data represent the number of years a newborn infant would live assuming the mortality rate at their birth remains constant throughout their life.
We proceeded by pivoting both data sets into long format, then consolidating them into a singular data set containing variables for country, year, CO\(_{2}\) emissions, and life expectancy.
While pivoting the data, we received some warnings that resulted from R reading in the negative dashes as character values as opposed to numeric, specifically in our CO\(_{2}\) emissions data. To fix this issue, we replaced all dash symbols with the appropriate negative sign. This process was only applied for the CO\(_{2}\) emissions data. The year variable was also converted to numeric. Overall, we cleaned the data to create a usable data set to investigate how how carbon dioxide emissions impacts life expectancy.
Method
We analyzed the relationship between CO\(_{2}\) emissions and life expectancy, using CO\(_{2}\) emissions as the explanatory variable and life expectancy as the response variable. We expected a negative relationship between CO\(_{2}\) emissions and life expectancy, where countries that produce greater amounts of CO\(_{2}\) tend to have lower life expectancy. We assumed CO\(_{2}\) emissions decrease air quality and caused potential respiratory issues, decreasing life expectancy.
We assessed this relationship by using a Simple Linear Regression (SLR) to fit a linear model between our variables averaged across time. By using SLR, we assumed that there is a linear relationship between average CO\(_{2}\) emissions and average life expectancy per country, and we can use that to estimate a country’s average life expectancy based on their average CO\(_{2}\) emissions. This function is of the form: \(\widehat{response} = intercept + slope*explanatory\). That is, we can predict the average life expectancy of a country by multiplying their average CO\(_{2}\) emissions by the model slope and adding the model intercept.
It is important to note that not all assumptions necessary for SLR are not met with our data:
We can assume independence because we are observing average values of life expectancy per country. Each country should generally have its own state of conditions that determine life expectancy, however we should be weary of a possible lack of independence. If we did not average our data, meaning there are multiple observational rows per country, then independence would be violated.
We can assume normality because a histogram of average life expectancy per country is fairly bell shaped. The distribution is slightly right skewed, but our sample size of countries is large (\(n\) = 186) so should still be reliable.
We can assume equal variance because the points of a residuals versus fits based on a SLR model, although shift up and down, do not appear to fan in or out.
However, we cannot assume linearity because there is a slight curve seen in the visualization of the relationship between CO\(_{2}\) emissions and life expectancy in our data (Figure 2), as well as in the linear model’s residuals versus fits plot. Thus, we should proceed with caution when interpreting results produced by a linear regression model.
Results
Data Visualization
Our data immediately contrasted our hypothesis by showing a positive relationship between CO\(_{2}\) emissions and life expectancy, as opposed to a negative relationship like we expected. We investigated how CO\(_{2}\) emissions and life expectancy per country change over time:
Code
# Visualization 1:animated <- project_data |>ggplot(aes(x = co2, y = life_exp)) +geom_point(alpha =0.7, show.legend =FALSE,na.rm =TRUE) +labs(title ="Average CO2 Emissions vs. Life Expectancy per Country in Year: {round(frame_time)}",x ="CO2 Emissions (tonnes per capita)",y ="",subtitle ="Life Expectancy (years)") +transition_time(year) +ease_aes("linear") +theme_bw()animate(animated, renderer =gifski_renderer(), fps =7)
Figure 1 depicts the relationship between the explanatory variable, carbon dioxide emissions per capita in tonnes, and the response variable, life expectancy in years, for each country over years 1800 to 2022. It appears that with each year, both CO\(_{2}\) emissions and life expectancy increase. They likely both increase over time due to technological advancements. CO\(_{2}\) emissions have likely increased burning of fossil fuels from developments like cars. While the effects of CO\(_{2}\) emissions can create an impact on an individuals health, society also still progresses in fields such as health care and advancements in medicine. To further investigate, we analyzed the relationship between the CO\(_{2}\) emissions and life expectancy per country averaged over the years:
Code
# Averaged project datacountry_data <- project_data |>group_by(country) |>summarize(avg_co2 =mean(co2),avg_life_exp =mean(life_exp)) # Visualization 2:country_data |>ggplot(aes(x = avg_co2,y = avg_life_exp)) +geom_jitter(na.rm =TRUE) +geom_smooth(method ="lm", na.rm =TRUE) +theme_bw() +labs(title =expression("Average CO"*""[2]*" Emissions and Life Expectancy per Country"),x =expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),y =NULL,subtitle ="Average Life Expectancy (years)")
Figure 2 depicts the average carbon dioxide emissions per capita in tonnes per country and the average life expectancy per country over the years 1800 to 2022. It appears that there is a positive, weak to moderate somewhat linear relationship between the two variables. There is an apparent increase in life expectancy as CO\(_{2}\) emissions increase with a curved pattern where there is a steep slope, but then a plateau.
This positive relationship can be explained because, as mentioned, over time there have been continuous technological developments that require increased burning of fossil fuels which emit CO\(_{2}\), and many allow access to improved healthcare and sanitation. The latter can contribute to the positive growth in the estimated life expectancy, even if CO\(_{2}\) emissions themselves would have a negative effect. It appears that there are potential confounding variables between these two variables.
Linear Regression Model
To further analyze our data, we fit a simple linear regression model to the data. We then analyzed whether the resulting model was a good predictor of our data:
Which provides the equation: \(\widehat{y} = 41.149 + 1.806x\)
Our predicted response variable (\(\widehat{y}\)) is the estimated average life expectancy per country, based on our explanatory variable (\(x\)), average CO\(_{2}\) emissions in tonnes per capita. The intercept in our simple linear model provided us with the information that when carbon dioxide emissions are zero, the estimated average life expectancy is 41.149 years. The slope term in our model is stating that for every additional one tonne per capita of CO\(_{2}\) emissions, the estimated average life expectancy goes up by 1.806 years, on average.
Model Fit
To assess the fit of our linear regression model, we calculated the variance in the response, fitted values, and residuals. Additionally, we analyzed the proportion of the variability in the response values was accounted for by the regression model:
Code
augment(project_lm) |>summarize(var_response =var(avg_life_exp),var_fit =var(.fitted),var_resid =var(.resid)) |>mutate(var_prop = var_fit/var_response) |> knitr::kable(digits =3,caption ="Figure 4.",col.names =c("Variance in Response","Variance in Fitted","Variance in Residual", "Proportion"),align=c(rep('c',times=4)))
Figure 4.
Variance in Response
Variance in Fitted
Variance in Residual
Proportion
33.054
14.018
19.036
0.424
The table above shows that the 14.018 represents the amount of variability in the average life expectancy that was accounted for by CO\(_{2}\) emissions. The value 19.036 is the amount of variability in life expectancy that has not been accounted for.
Figure 4 also shows 42.4% of variability in life expectancy is explained by our linear regression model using CO\(_{2}\) emissions as a predictor (\(14.018/33.054 = 0.424\)). This is a relatively moderate percentage of variability that can be explained by the model. This means the model is not the best representation of the variability in average life expectancy as it accounts for less than half of the variability.
Simulation
To further investigate whether the linear regression model would be an appropriate descriptor of the data, we simulated data to assess the quality of our linear model by comparing them to the observed data:
# observed plotcountry_data |>ggplot(aes(x = avg_co2,y = avg_life_exp)) +geom_point(na.rm =TRUE) +theme_bw() +labs(title ="Observed Life Expectancy per Country",x =expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),y =NULL,subtitle ="Average Life Expectancy (years)") +ylim(25, 70)# simulated plotsimulation <-as.vector(predict(project_lm) +rnorm(n =186, mean =0, sd =sigma(project_lm)))sim_data <- country_data |>filter(!is.na(avg_life_exp)) |>mutate(sim = simulation)sim_data |>ggplot(mapping =aes(x = avg_co2, y = sim)) +geom_point(na.rm =TRUE) +labs(title ="Simulated Life Expectancy per Country based on Linear Regression Model",x =expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),y ="",subtitle ="Average Life Expectancy (years)" ) +theme_bw() +ylim(25, 70)
In comparing plots of the observed average life expectancy and simulated average life expectancy against CO\(_{2}\) emissions (Figure 5), their directions and strengths are similar, although the simulated values are noticeably more linear in form as they are based on the SLR model and the observed values follow a curved shape. The simulated data consequently also have lower minimum values and higher maximum values. This tells us that we should be hesitant to say that the linear model it a good fit.
We continued by analyzing \(R^{2}\) values to tell us how well the simulated data matched the observed data. It is calculated on a scale from 0 to 1, with \(R^{2}\) values closer to 1 signifying a stronger model:
The \(R^{2}\) values for simulations based on our model histogram appears to have a normal distribution, with a possible right-skew. It appears to be centered around 0.1791 (Figure 6), which is relatively on the low-side. This indicates the data simulated under this statistical model are weakly similar to what was observed. On average, our simulated data accounts for about 17.91% of the variability in the observed life expectancy.
=======
The \(R^{2}\) values for simulations based on our model histogram appears to have a normal distribution, with a possible right-skew. It appears to be centered around 0.1798 (Figure 6), which is relatively on the low-side. This indicates the data simulated under this statistical model are weakly similar to what was observed. On average, our simulated data accounts for about 17.98% of the variability in the observed life expectancy.
>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f
Conclusion
The \(R^{2}\) value 42.4% of our SLR model (Figure 4) allows us to conclude that, accounting for not even half of the variability, it does not appear to be a particularly strong model of life expectancy.
<<<<<<< HEAD
Similarly, since the \(R^{2}\) value of the observed average life expediencies is relatively small, only about 0.1791, we can conclude that our simulated linear regression model does not match our observed data very well. Thus, it is further supported that our SLR model and does not have very strong predictive ability.
=======
Similarly, since the \(R^{2}\) value of the observed average life expediencies is relatively small, only about 0.1798, we can conclude that our simulated linear regression model does not match our observed data very well. Thus, it is further supported that our SLR model and does not have very strong predictive ability.
>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f
This is understandable due to the fact that our original linear model assumptions were not met - specifically the condition of linearity. Since our original observed data was not very linear, it follows that a linear model would not be the best fit.
In conclusion we have determined that there is likely not a linear relationship between average CO\(_2\) emissions per capita and average life expectancy. However, there may be a different model that would better represent a relationship.
Source Code
---title: "Stat 331: Final Project"author: "Alea Seifert, Robelyn Felices, Franchesca Garcia, and Libby Brill"embed-resources: trueformat: html: self-contained: true code-tools: true toc: trueeditor: sourcecode-fold: trueexecute: error: true echo: true message: false warning: false---```{r setup}#| echo: false# load in datalibrary(tidyverse)library(gganimate)library(gifski)library(broom)library(kableExtra)library(scales)```## Introduction#### Data Description and CleaningFor our final project, we chose to focus on how carbon dioxide (CO$_{2}$) emissions impacts life expectancy. We used data sets taken from Gapminder, which have been combined and collected from a variety of sources. The carbon dioxide emissions data set contains 194 observations (countries) with 224 variables (ranging in year from 1800 to 2022). These data represent the recorded consumption-based CO$_{2}$ emissions, in tonnes of CO$_{2}$ per capita. The life expectancy data set contains 195 observations (countries) with 302 variables (ranging in year from 1800 to 2100). The life expectancy data was collected from three main sources including 100 sources compiled by Mattias Lindgren from 1800-1970, the Global Burden of Disease Study from the Institute for Health Metrics and Evaluations containing data for 1970 to 2016, and World Population Prospects from the UN forecasts which published interpolated demographics from 2017-2099. These data represent the number of years a newborn infant would live assuming the mortality rate at their birth remains constant throughout their life.```{r data}# Dataco2_emissions <- read_csv(here::here("Datasets", "co2_pcap_cons.csv"))life_expectancy <- read_csv(here::here("Datasets", "lex.csv"))# Pivoting Longerco2_emissions_long <- co2_emissions |> mutate(across(`1800`:`2022`, ~ str_replace_all(.x, "−", "-"))) |> mutate(across(`1800`:`2022`, ~ as.numeric(.x))) |> pivot_longer( cols = `1800`:`2022`, names_to = "year", values_to = "co2" )life_exp_long <- life_expectancy |> mutate(across(`1800`:`2100`, ~ as.numeric(.x))) |> pivot_longer( cols = `1800`:`2100`, names_to = "year", values_to = "life_exp" )# Join Datasetsproject_data <- co2_emissions_long |> inner_join(life_exp_long, by = c("country", "year") ) |> mutate(year = as.numeric(year))```We proceeded by pivoting both data sets into long format, then consolidating them into a singular data set containing variables for country, year, CO$_{2}$ emissions, and life expectancy.While pivoting the data, we received some warnings that resulted from R reading in the negative dashes as character values as opposed to numeric, specifically in our CO$_{2}$ emissions data. To fix this issue, we replaced all dash symbols with the appropriate negative sign. This process was only applied for the CO$_{2}$ emissions data. The year variable was also converted to numeric. Overall, we cleaned the data to create a usable data set to investigate how how carbon dioxide emissions impacts life expectancy.#### MethodWe analyzed the relationship between CO$_{2}$ emissions and life expectancy, using CO$_{2}$ emissions as the explanatory variable and life expectancy as the response variable. We expected a negative relationship between CO$_{2}$ emissions and life expectancy, where countries that produce greater amounts of CO$_{2}$ tend to have lower life expectancy. We assumed CO$_{2}$ emissions decrease air quality and caused potential respiratory issues, decreasing life expectancy.We assessed this relationship by using a Simple Linear Regression (SLR) to fit a linear model between our variables averaged across time. By using SLR, we assumed that there is a linear relationship between average CO$_{2}$ emissions and average life expectancy per country, and we can use that to estimate a country’s average life expectancy based on their average CO$_{2}$ emissions. This function is of the form: $\widehat{response} = intercept + slope*explanatory$. That is, we can predict the average life expectancy of a country by multiplying their average CO$_{2}$ emissions by the model slope and adding the model intercept.It is important to note that not all assumptions necessary for SLR are not met with our data:We can assume independence because we are observing average values of life expectancy per country. Each country should generally have its own state of conditions that determine life expectancy, however we should be weary of a possible lack of independence. If we did not average our data, meaning there are multiple observational rows per country, then independence would be violated.We can assume normality because a histogram of average life expectancy per country is fairly bell shaped. The distribution is slightly right skewed, but our sample size of countries is large ($n$ = 186) so should still be reliable.We can assume equal variance because the points of a residuals versus fits based on a SLR model, although shift up and down, do not appear to fan in or out.However, we cannot assume linearity because there is a slight curve seen in the visualization of the relationship between CO$_{2}$ emissions and life expectancy in our data (Figure 2), as well as in the linear model's residuals versus fits plot. Thus, we should proceed with caution when interpreting results produced by a linear regression model.## Results#### Data VisualizationOur data immediately contrasted our hypothesis by showing a positive relationship between CO$_{2}$ emissions and life expectancy, as opposed to a negative relationship like we expected. We investigated how CO$_{2}$ emissions and life expectancy per country change over time:```{r}#| fig-cap: "Figure 1."#| fig-align: center# Visualization 1:animated <- project_data |>ggplot(aes(x = co2, y = life_exp)) +geom_point(alpha =0.7, show.legend =FALSE,na.rm =TRUE) +labs(title ="Average CO2 Emissions vs. Life Expectancy per Country in Year: {round(frame_time)}",x ="CO2 Emissions (tonnes per capita)",y ="",subtitle ="Life Expectancy (years)") +transition_time(year) +ease_aes("linear") +theme_bw()animate(animated, renderer =gifski_renderer(), fps =7)```Figure 1 depicts the relationship between the explanatory variable, carbon dioxide emissions per capita in tonnes, and the response variable, life expectancy in years, for each country over years 1800 to 2022. It appears that with each year, both CO$_{2}$ emissions and life expectancy increase. They likely both increase over time due to technological advancements. CO$_{2}$ emissions have likely increased burning of fossil fuels from developments like cars. While the effects of CO$_{2}$ emissions can create an impact on an individuals health, society also still progresses in fields such as health care and advancements in medicine. To further investigate, we analyzed the relationship between the CO$_{2}$ emissions and life expectancy per country averaged over the years:```{r}#| fig-cap: "Figure 2."#| fig-align: center# Averaged project datacountry_data <- project_data |>group_by(country) |>summarize(avg_co2 =mean(co2),avg_life_exp =mean(life_exp)) # Visualization 2:country_data |>ggplot(aes(x = avg_co2,y = avg_life_exp)) +geom_jitter(na.rm =TRUE) +geom_smooth(method ="lm", na.rm =TRUE) +theme_bw() +labs(title =expression("Average CO"*""[2]*" Emissions and Life Expectancy per Country"),x =expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),y =NULL,subtitle ="Average Life Expectancy (years)")```Figure 2 depicts the average carbon dioxide emissions per capita in tonnes per country and the average life expectancy per country over the years 1800 to 2022. It appears that there is a positive, weak to moderate somewhat linear relationship between the two variables. There is an apparent increase in life expectancy as CO$_{2}$ emissions increase with a curved pattern where there is a steep slope, but then a plateau.This positive relationship can be explained because, as mentioned, over time there have been continuous technological developments that require increased burning of fossil fuels which emit CO$_{2}$, and many allow access to improved healthcare and sanitation. The latter can contribute to the positive growth in the estimated life expectancy, even if CO$_{2}$ emissions themselves would have a negative effect. It appears that there are potential confounding variables between these two variables.#### Linear Regression ModelTo further analyze our data, we fit a simple linear regression model to the data. We then analyzed whether the resulting model was a good predictor of our data:```{r}#| fig-align: center# Model:project_lm <-lm(avg_life_exp ~ avg_co2,data = country_data)# Coefficients:project_lm |> broom::tidy() |>mutate(p.value = scales::pvalue(p.value)) |> knitr::kable(digits =3,caption ="Figure 3.",col.names =c("Term","Estimate","Standard Error", "Statistic","P-value"),align=c(rep('c',times=6)))```Which provides the equation: $\widehat{y} = 41.149 + 1.806x$Our predicted response variable ($\widehat{y}$) is the estimated average life expectancy per country, based on our explanatory variable ($x$), average CO$_{2}$ emissions in tonnes per capita. The intercept in our simple linear model provided us with the information that when carbon dioxide emissions are zero, the estimated average life expectancy is 41.149 years. The slope term in our model is stating that for every additional one tonne per capita of CO$_{2}$ emissions, the estimated average life expectancy goes up by 1.806 years, on average.#### Model FitTo assess the fit of our linear regression model, we calculated the variance in the response, fitted values, and residuals. Additionally, we analyzed the proportion of the variability in the response values was accounted for by the regression model:```{r}augment(project_lm) |>summarize(var_response =var(avg_life_exp),var_fit =var(.fitted),var_resid =var(.resid)) |>mutate(var_prop = var_fit/var_response) |> knitr::kable(digits =3,caption ="Figure 4.",col.names =c("Variance in Response","Variance in Fitted","Variance in Residual", "Proportion"),align=c(rep('c',times=4)))```The table above shows that the 14.018 represents the amount of variability in the average life expectancy that was accounted for by CO$_{2}$ emissions. The value 19.036 is the amount of variability in life expectancy that has not been accounted for.Figure 4 also shows 42.4% of variability in life expectancy is explained by our linear regression model using CO$_{2}$ emissions as a predictor ($14.018/33.054 = 0.424$). This is a relatively moderate percentage of variability that can be explained by the model. This means the model is not the best representation of the variability in average life expectancy as it accounts for less than half of the variability.#### SimulationTo further investigate whether the linear regression model would be an appropriate descriptor of the data, we simulated data to assess the quality of our linear model by comparing them to the observed data:```{r}#| layout-ncol: 2#| fig-align: center#| fig-cap: "Figure 5."# observed plotcountry_data |>ggplot(aes(x = avg_co2,y = avg_life_exp)) +geom_point(na.rm =TRUE) +theme_bw() +labs(title ="Observed Life Expectancy per Country",x =expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),y =NULL,subtitle ="Average Life Expectancy (years)") +ylim(25, 70)# simulated plotsimulation <-as.vector(predict(project_lm) +rnorm(n =186, mean =0, sd =sigma(project_lm)))sim_data <- country_data |>filter(!is.na(avg_life_exp)) |>mutate(sim = simulation)sim_data |>ggplot(mapping =aes(x = avg_co2, y = sim)) +geom_point(na.rm =TRUE) +labs(title ="Simulated Life Expectancy per Country based on Linear Regression Model",x =expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),y ="",subtitle ="Average Life Expectancy (years)" ) +theme_bw() +ylim(25, 70)```In comparing plots of the observed average life expectancy and simulated average life expectancy against CO$_{2}$ emissions (Figure 5), their directions and strengths are similar, although the simulated values are noticeably more linear in form as they are based on the SLR model and the observed values follow a curved shape. The simulated data consequently also have lower minimum values and higher maximum values. This tells us that we should be hesitant to say that the linear model it a good fit.We continued by analyzing $R^{2}$ values to tell us how well the simulated data matched the observed data. It is calculated on a scale from 0 to 1, with $R^{2}$ values closer to 1 signifying a stronger model:```{r}#| fig-cap: "Figure 6."#| fig-align: centernsims <-1000sims <-map_dfc(.x =1:nsims,.f =~tibble(sim =as.vector(predict(project_lm) +rnorm(n =186,mean =0,sd =sigma(project_lm) ))))colnames(sims) <-colnames(sims) |>str_replace(pattern ="\\.\\.\\.",replace ="_" )sims <- project_data |>group_by(country) |>summarize(avg_life_exp =mean(life_exp)) |>filter(!is.na(avg_life_exp)) |>select(avg_life_exp) |>bind_cols(sims)sims_r <- sims |>map(~lm(avg_life_exp ~ .x,data = sims )) |>map(glance) |>map_dbl(~ .x$r.squared)sims_r <- sims_r[names(sims_r) !="avg_life_exp"]tibble(sims = sims_r) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(title =expression("Distribution of Simulated"~ R^2),x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models" ) +theme_bw()```The $R^{2}$ values for simulations based on our model histogram appears to have a normal distribution, with a possible right-skew. It appears to be centered around `r round(median(sims_r), 4)` (Figure 6), which is relatively on the low-side. This indicates the data simulated under this statistical model are weakly similar to what was observed. On average, our simulated data accounts for about `r round(median(sims_r), 4) * 100`% of the variability in the observed life expectancy.## ConclusionThe $R^{2}$ value 42.4% of our SLR model (Figure 4) allows us to conclude that, accounting for not even half of the variability, it does not appear to be a particularly strong model of life expectancy.Similarly, since the $R^{2}$ value of the observed average life expediencies is relatively small, only about `r round(median(sims_r), 4)`, we can conclude that our simulated linear regression model does not match our observed data very well. Thus, it is further supported that our SLR model and does not have very strong predictive ability.This is understandable due to the fact that our original linear model assumptions were not met - specifically the condition of linearity. Since our original observed data was not very linear, it follows that a linear model would not be the best fit.In conclusion we have determined that there is likely not a linear relationship between average CO$_2$ emissions per capita and average life expectancy. However, there may be a different model that would better represent a relationship.